This document is intended to give an overview of the response distributions from our pilot.
The data is already anonymous and in a tidy format at this stage in the analysis pipeline. We just need to read it in and do some preprocessing.
# read in data
full_df <- read_csv("pilot-anonymous.csv")
## Parsed with column specification:
## cols(
## .default = col_double(),
## workerId = col_character(),
## batch = col_integer(),
## condition = col_character(),
## start_gain_frame = col_character(),
## numeracy = col_integer(),
## gender = col_character(),
## age = col_character(),
## education = col_character(),
## chart_use = col_character(),
## intervene = col_integer(),
## outcome = col_character(),
## pSup = col_integer(),
## trial = col_character(),
## trialIdx = col_character()
## )
## See spec(...) for full column specifications.
# preprocessing
responses_df <- full_df %>%
rename( # rename to convert away from camel case
worker_id = workerId,
account_value = accountValue,
ground_truth = groundTruth,
p_award_with = pAwardWith,
p_award_without = pAwardWithout,
p_superiority = pSup,
start_time = startTime,
resp_time = respTime,
trial_dur = trialDur,
trial_idx = trialIdx
) %>%
filter(trial_idx != "practice", trial_idx != "mock") %>% # remove practice and mock trials from responses dataframe, leave in full version
mutate( # flip probability of superiority responses below 50% for the loss frame
p_superiority = if_else(ground_truth < 0.5,
100 - p_superiority,
as.numeric(p_superiority)) # hack to avoid type error
) %>%
mutate( # mutate to jitter probability of superiority away from boundaries
p_superiority = ifelse(p_superiority == 0, 0.25, p_superiority), # avoid responses equal to zero
p_superiority = ifelse(p_superiority == 100, 99.75, p_superiority) # avoid responses equal to one-hundred
) %>%
mutate( # mutate to rows where intervene == -1 for some reason
intervene = if_else(intervene == -1,
# repair
if_else((payoff == (award_value - 1) | payoff == (-award_value - 1) | payoff == -1),
1, # payed for intervention
0), # didn't pay for intervention
# don't repair
as.numeric(intervene) # hack to avoid type error
)
)
head(responses_df)
## # A tibble: 6 x 29
## worker_id batch condition baseline award_value exchange start_gain_frame
## <chr> <int> <chr> <dbl> <dbl> <dbl> <chr>
## 1 bf797261 4 means_on… 0.5 2.25 0.5 True
## 2 bf797261 4 means_on… 0.5 2.25 0.5 True
## 3 bf797261 4 means_on… 0.5 2.25 0.5 True
## 4 bf797261 4 means_on… 0.5 2.25 0.5 True
## 5 bf797261 4 means_on… 0.5 2.25 0.5 True
## 6 bf797261 4 means_on… 0.5 2.25 0.5 True
## # … with 22 more variables: total_bonus <dbl>, duration <dbl>,
## # numeracy <int>, gender <chr>, age <chr>, education <chr>,
## # chart_use <chr>, account_value <dbl>, ground_truth <dbl>,
## # intervene <dbl>, outcome <chr>, pAwardCurrent <dbl>, pAwardNew <dbl>,
## # p_award_with <dbl>, p_award_without <dbl>, p_superiority <dbl>,
## # payoff <dbl>, resp_time <dbl>, start_time <dbl>, trial <chr>,
## # trial_dur <dbl>, trial_idx <chr>
Let’s plot histograms of probability of superiority judgments at each level of the ground truth probability of superiority. We show the ground truth in red. This will give us an overview of bias and precision in judgments. We do this separately for each visualization condition to limit the number of faceted subplots in a single view.
for (cond in unique(responses_df$condition)) {
plt <- responses_df %>% filter(condition == cond) %>%
ggplot(aes(x = p_superiority)) +
geom_histogram(aes(y = ..density..), binwidth = 5) +
geom_vline(aes(xintercept = ground_truth * 100, linetype = "Ground Truth"), color = "red") +
scale_linetype_manual(name = "Line", values = c(2,1), guide=guide_legend(override.aes = list(color = c("red")))) +
theme_bw() +
labs(
caption=cond,
x = "Probability of Superiority Responses",
y = "Frequency"
) +
facet_wrap( ~ ground_truth)
print(plt)
}
As we would expect based on a linear log odds representation of probability, probability of superiority judgments tend to be biased toward 50% relative to the ground truth.
Another more compact way of looking at the relationship between estimated probability of superiority and the ground truth is to just plot them against one another. Let’s look at this even though its sort of a mess.
# plot estimated probability of superiority vs the ground truth
responses_df %>%
ggplot(aes(x = ground_truth, y = p_superiority)) +
geom_point(alpha = 0.35) +
geom_abline(aes(intercept = 0, slope = 100, linetype = "Correct Response"), color = "red") +
scale_linetype_manual(name = "Line", values = c(2,1), guide=guide_legend(override.aes = list(color = c("red")))) +
theme_bw() +
labs(
x = "Ground Truth",
y = "Probability of Superiority Responses"
) +
facet_grid(condition ~ baseline)
We can see that constraining responses to a scale from 50 to 100 has produced an artificial bound on variability. This might be an issue when it comes to modeling.
In order to see how people are doing on the decision task, we want to benchmark their performance against a utility optimal decision rule. The rule is different depending on whether the task is framed as a gain or a loss (i.e., whether the ground truth probability of superiority is greater than or less than 50%).
# determine whether or not intervention is utility optimal on each trial
responses_df <- responses_df %>%
mutate(should_intervene = if_else(ground_truth > 0.5,
(p_award_with - p_award_without) > 1 / award_value, # gain framing decision rule
((1 - p_award_without) - (1 - p_award_with) ) > 1 / award_value) # loss framing decision rule
)
Let’s plot the proportion of users who intervene at each level of ground truth probability of superiority in each visualization condition. People should intervene more often at extreme probabilities. We show the utility optimal decision threshold in red. This should give us an overview of decision quality.
# summarise the data as the overall proportion of trials where users intervene vs what they should do at each level of ground_truth * condition * baseline
responses_df %>%
group_by(condition, ground_truth) %>%
summarise(
proportion_intervene = sum(intervene) / n(),
optimal_decision = mean(should_intervene)
) %>%
ggplot(aes(x = ground_truth, y = proportion_intervene)) +
geom_point(alpha = 0.35) +
geom_line(aes(y = optimal_decision, linetype="Optimal Decision Rule"), color="red") +
scale_linetype_manual(name="Line", values = c(2,1), guide=guide_legend(override.aes=list(color=c("red")))) +
theme_bw() +
labs(
x = "Ground Truth Probability of Superiority",
y = "Proportion of Trials Where Users Intervene"
) +
facet_grid(. ~ condition)
In the aggregate, differences between conditions are pretty subtle. It looks like there may be a slight discrepancy in performance between the gain and loss framing trials (i.e., above and below a ground truth of 50%). We’ll need to tease these effects out using statistical inference.
It might also be interesting to see how decisions corresond to probability of superiority judgments. We omit the ground truth and optimal decision rule from this chart.
# summarise the data as the overall proportion of trials where users choose to intervene at each level of condition * baseline * p_superiority
responses_df %>%
group_by(condition, p_superiority) %>%
summarise(proportion_intervene = sum(intervene) / n()) %>%
ggplot(aes(x = p_superiority, y = proportion_intervene)) +
geom_point(alpha = 0.35) +
theme_bw() +
labs(
caption=cond,
x = "Estimated Probability of Superiority",
y = "Proportion of Trials Where Users Choose to Intervene"
) +
facet_grid(. ~ condition)
People’s probability of superiority judgments and decisions are correlated in the way that you would expect if they understood the task.
We want to know when, if at all, spending more time on a response results in improved performance.
Let’s look at probability of superiority estimates as a function of trial duration. As before, we show the ground truth in red and separate visualization conditions into different views to limit the number of faceted subplots in a single view.
for (cond in unique(responses_df$condition)) {
plt <- responses_df %>% filter(condition == cond) %>%
ggplot(aes(x = log(trial_dur), y = p_superiority)) +
geom_hline(aes(yintercept = ground_truth * 100, linetype = "Ground Truth"), color = "red") +
scale_linetype_manual(name = "Line", values = c(2,1), guide=guide_legend(override.aes = list(color = c("red")))) +
geom_point(alpha = 0.35) +
theme_bw() +
labs(
caption=cond,
x = "Trial Duration in log(seconds)",
y = "Estimated Probability of Superiority"
) +
facet_wrap( ~ ground_truth)
print(plt)
}
Trial duration seems mostly unrelated to probability of superiority judgments overall.
A nice metric for decision quality is whether users responded “correctly” or in line with the normative utility optimal decision rule. We calculate whether the user was “correct”" or not on each trial.
# determine whether response on each trial is utility optimal
responses_df <- responses_df %>%
mutate(correct = intervene == should_intervene)
Let’s look at the proportion correct as a function of trial duration, faceting visualization conditions as above.
# summarise the data as the overall proportion of trials where users make utility optimal decisions at each level of condition * baseline * trial_dur
responses_df %>%
mutate(trial_dur_binned = round(trial_dur)) %>%
group_by(condition, baseline, trial_dur_binned) %>%
summarise(proportion_correct = sum(correct) / n()) %>%
ggplot(aes(x = log(trial_dur_binned), y = proportion_correct)) +
geom_point(alpha = 0.35) +
theme_bw() +
labs(
caption=cond,
x = "Binned Trial Duration in log(seconds)",
y = "Proportion of Trials Where Users Make Utility Optimal Decisions"
) +
facet_grid(. ~ condition)
Trial duration seems to have little to do with decision quality.
In this section, we look for patterns of interest in response errors. We’ll start by adding error and absolute error in probability of superiority judgments to the dataframe. We already have a metric for correctness of decisions.
# add error and absolute error to df
responses_df <- responses_df %>%
mutate(
err_p_sup = ground_truth * 100 - p_superiority,
abs_err_p_sup = abs(err_p_sup)
)
Let’s look at the average absolute error in probability of superiority judgments in each condition, regardless of the ground truth.
# avg absolute error per condition
responses_df %>%
group_by(condition, baseline) %>%
summarise(avg_abs_err_p_sup = mean(abs_err_p_sup)) %>%
ggplot(aes(x = condition, y = avg_abs_err_p_sup, fill = condition)) +
geom_bar(stat = "identity") +
theme_bw() +
labs(title = "Average Absolute Error Relative to Ground Truth",
x = "Visualization Condition",
y = "Average Absolute Error (in years out of 100)"
)
On average, errors in probability of superiority judgments are high across the board, with the average error equal to about a third the range of possible responses. Error rates are noteably high in the intervals with means condition.
Let’s look at the average signed error in probability of superiority judgments. This time we’ll plot error in each condition in relation to ground truth.
# error by ground truth, per condition
responses_df %>%
group_by(ground_truth, condition, baseline) %>%
summarise(avg_err_p_sup = mean(err_p_sup)) %>%
ggplot(aes(x = ground_truth, y = avg_err_p_sup, color = condition)) +
geom_line() +
theme_bw() +
labs(title = "Average Error Relative to Ground Truth",
x = "Ground Truth Probability of Superiority",
y = "Average Error (out of 100)"
)
Again, we can see that errors are large on average, especially at the extreme ends of the probability scale. Higher errors at the extremes of the probability scale are expected based on the cental tendency of judgments.
Let’s take a similar approach to visualizing decisions by looking at the proportion of utility optimal decisions as a function of ground truth probability of superiority and condition.
# error by ground truth, per condition
responses_df %>%
group_by(ground_truth, condition, baseline) %>%
summarise(proportion_correct = sum(correct) / n()) %>%
ggplot(aes(x = ground_truth, y = proportion_correct, color = condition)) +
geom_line() +
theme_bw() +
labs(title = "Proportion of Utility Optimal Intervention Decisions",
x = "Ground Truth Probability of Superiority",
y = "Proportion of Utility Optimal Decisions"
)
We can see that there are dips in performance near 25% and 75% probability of superiority. This is expected considering that these are the charts for which the intervention decision is most ambiguous. Interestingly, the parts of the probability scale which are perceived most accurately are the same parts where people make the most utility optimal decisions. It’s unclear whether this is just a consequence of the task structure or also a feature of perceptual reasoning to support decisions.
We should check whether responses are biased by the framing of the problem either as a potential gain when probability of superiority is greater than 50% or as a potential loss when probability of superiority is less than 50%.
Let’s start by looking at signed errors in probability of superiority estimates. We facet out visualization conditions in this view to make it easier to detect asymmetries between framing conditions.
# reflect error where probability of superiority < 50% onto range between 0.5 and 1
responses_df %>%
mutate(
ground_truth_50_100 = ifelse(ground_truth < 0.5, 1 - ground_truth, ground_truth),
framing = ifelse(ground_truth > 0.5, "Gain", "Loss")
) %>%
group_by(ground_truth_50_100, condition, baseline, framing) %>%
summarise(avg_err_p_sup = mean(err_p_sup)) %>%
ggplot(aes(x = ground_truth_50_100, y = avg_err_p_sup, color = condition)) +
geom_line(aes(linetype = framing)) +
theme_bw() +
labs(title = "Framing Effects on Average Error in Estimated Probability of Superiority",
x = "Probability that Team is Better with New Player",
y = "Average Error in Estimated Probability of Superiority"
) +
facet_grid(. ~ condition)
Asymmetries do not seem substantial.
Next, let’s use a similar visualization to investigate framing bias in the proportion of utility optimal decisions.
# reflect error where Pr(A > B) < 0.5 onto range between 0.5 and 1
responses_df %>%
mutate(
ground_truth_50_100 = ifelse(ground_truth < 0.5, 1 - ground_truth, ground_truth),
framing = ifelse(ground_truth > 0.5, "Gain", "Loss")
) %>%
group_by(ground_truth_50_100, condition, baseline, framing) %>%
summarise(proportion_correct = sum(correct) / n()) %>%
ggplot(aes(x = ground_truth_50_100, y = proportion_correct, color = condition)) +
geom_line(aes(linetype = framing)) +
theme_bw() +
labs(title = "Framing Effects on Proportion of Utility Optimal Decisions",
x = "Probability that Team is Better with New Player",
y = "Proportion of Decisions that Are Utility Optimal"
) +
facet_grid(. ~ condition)
It looks like there may be a reversal where users make better decisions in the gain frame at high levels of ground truth probability of superiority, but otherwise users make utility optimal decisions at least as often in the loss frame. This pattern is clearest in the HOPs and intervals_w_means conditions. However, sampling error makes it difficult to see.
As is often the case with judgments from visualizations, the data seem highly heterogenious. We try to get a sense of this by looking at individual patterns of responses in conjunction with individual characteristics such as gender, age, education, chart use, and numeracy. Below we create an overview of performance and individual characteristics for each participant separately.
for (worker in unique(responses_df$worker_id)) {
# get a df for just this worker
worker_df <- responses_df %>% filter(worker_id == worker)
# plot probability of superiority judgments vs ground truth
p_sup_plt <- worker_df %>%
ggplot(aes(x = ground_truth, y = p_superiority)) +
geom_point(alpha = 0.35) +
geom_abline(aes(intercept = 0, slope = 100, linetype = "Correct Response"), color = "red") +
scale_linetype_manual(name = "Line", values = c(2,1), guide=guide_legend(override.aes = list(color = c("red")))) +
theme_bw() +
ylim(0, 100) +
labs(
x = "Ground Truth",
y = "Estimated Probability of Superiority"
)
# plot intervention decisions vs ground truth, noting which are in line with the utility optimal decision rule
decision_plt <- worker_df %>%
ggplot(aes(x = ground_truth, y = intervene, color = correct)) +
geom_point(alpha = 0.35) +
geom_line(aes(y = as.numeric(should_intervene), linetype="Optimal Decision Rule"), color="red") +
scale_linetype_manual(name="Line", values = c(2,1), guide=guide_legend(override.aes=list(color=c("red")))) +
theme_bw() +
labs(
x = "Ground Truth Probability of Superiority",
y = "Intervention Decision"
)
# create a table summarizing this worker
summary_table <- worker_df %>%
group_by(worker_id) %>%
summarise(
condition = unique(condition),
gender = unique(gender),
age = unique(age),
education = unique(education),
chart_use = unique(chart_use),
numeracy = unique(numeracy)
) %>%
select(-worker_id) %>%
ggtexttable(rows = NULL, theme = ttheme("blank"))
# stitch together these three views
charts <- ggarrange(p_sup_plt, decision_plt, ncol = 2, nrow = 1)
figure <- ggarrange(summary_table, charts, ncol = 1, nrow = 2)
print(figure)
}
How many participants meet our exclusion criteria? Here we use the practice trials as attention checks. If a user understands the decision problem, they should not intervene on either of the practice trials. We also look for participants who always responded the same thing on either question since these folks were probably speeding.
full_df %>%
# use practice trials as attention check
mutate(passed_check = (trial == "practice1" | trial == "practice2") & (intervene == 0)) %>%
# exclude if a worker has not passed the attention check on both practice trials
group_by(workerId, condition) %>%
summarise(
failed_attention_check = mean(passed_check) != 1,
speeding_p_sup = length(unique(pSup)) == 1,
speeding_intervene = length(unique(intervene)) == 1
) %>%
group_by(condition) %>%
summarise(
failed_attention_check = sum(failed_attention_check),
speeding_p_sup = sum(speeding_p_sup),
speeding_intervene = sum(speeding_intervene),
total = n()
)
## # A tibble: 3 x 5
## condition failed_attention_c… speeding_p_sup speeding_interve… total
## <chr> <int> <int> <int> <int>
## 1 HOPs 18 0 1 18
## 2 intervals_w_m… 19 0 2 19
## 3 means_only 19 0 0 19
It looks like most people are messing up at least on of the practice trials but very few people are speeding.
In an earlier version of the pilot, not everyone was shown all 24 levels of ground truth as intended. This is due to a bug in the interface code that reshuffled the trial set throughout the experiment for some participants. This was fixed after HIT assignment batch 7. Let’s check how many participants were shown various numbers of duplicate trials.
# create a grid of worker ids * trial indices, every trial that should exist
trials_should_exist_df <- data_grid(responses_df, worker_id = unique(worker_id), trial = unique(trial_idx))
# check the number of times each worker was shown each trial, and plot the number of workers shown various numbers of duplicates
responses_df %>% select(worker_id, trial_idx) %>%
right_join(trials_should_exist_df, by = "worker_id") %>%
group_by(worker_id, trial) %>%
summarise(n_times_shown_trial = sum(trial == trial_idx)) %>%
group_by(worker_id) %>%
summarise(n_duplicates = sum(n_times_shown_trial > 1)) %>%
ggplot(aes(x = n_duplicates)) +
geom_histogram(aes(y = ..count..), binwidth = 1, fill="black", col="grey") +
theme_bw() +
labs(
x = "Number of Duplicate Trials",
y = "Count of Participants"
)